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Abstract. In a network, identifying all vertices whose PageRank is more 
than a given threshold value A is a basic problem that has arisen in Web 
and social network analyses. In this paper, we develop a nearly optimal, 
sublinear time, randomized algorithm for a close variant of this problem. 
When given a directed network G = (V,E), a threshold value A, and 
a positive constant c > 3, with probability 1 — o(l), our algorithm will 
return a subset S C V with the property that S contains all vertices 
of PageRank at least A and no vertex with PageRank less than A/c. 
The running time of our algorithm is always O(^). In addition, our 
algorithm can be efficiently implemented in various network access mod- 
els including the Jump and Crawl query model recently studied by [6], 
making it suitable for dealing with large social and information networks. 

As part of our analysis, we show that any algorithm for solving this prob- 
lem must have expected time complexity of Q{^). Thus, our algorithm 
is optimal up to logarithmic factors. Our algorithm (for identifying ver- 
tices with significant PageRank) applies a multi-scale sampling scheme 
that uses a fast personalized PageRank estimator as its main subroutine. 
For that, we develop a new local randomized algorithm for approximat- 
ing personalized PageRank which is more robust than the earlier ones 
developed by Jeh and Widom ,9 and by Andersen, Chung, and Lang [2] . 



1 Introduction 

A basic problem in network analysis is to identify the set of its vertices that 
are "significant." For example, the significant nodes in the web graph defined 
by a query could provide the authoritative content in web search; they could 
be the critical proteins in a protein interaction network; and they could be the 
set of people (in a social network) most effective to seed the influence for online 
advertising. As the networks become larger, we need more efficient algorithms 
to identify these "significant" nodes. 
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1.1 Identifying Nodes with Significant PageRanks: Our Results 

The meaning of 'significant' vertices depend on the semantics of the network and 
the applications. In this paper, we focus on a particular measure of significance 
— the PageRanks of the vertices. PageRank was introduced by Page and Brin 
in their seminal work for ranking webpages [TT] . Mathematically, the PageRank 
(with restart constant, also known as the teleportation constant, a) of a web-page 
is proportional to the the probability that the page is visited by a random surfer 
who explores the web using the following simple random walk: at each step, with 
probability (1 — a) go to a random webpage linked to from the current page, 
and with probability a, restarts the process from a randomly chosen page. For 
reasons to be cleared shortly, we consider a normalization of the PageRank so 
that the sum of the PageRank values over all vertices is equal to n, the number 
of vertices in the network. In other words, suppose PageRank(w) denote the 
PageRank of vertex u in the network G = (V, E). Then, 



PageRank has been used by the Google search engine and has found applica- 
tions in wide range of data analysis problems [H[7]. In this context, the problem 
of identifying "significant" vertices could be illustrated by the following search 
problem: Let Top PageRanks denote the problem of identifying all vertices 
whose PageRanks in a network G = (V, E) are more than a given threshold 
value 1 < A < \V\. 

In this paper, we consider for the following close variant of Top PageRanks: 

Significant PageRanks: Given a network G = (V,E), a threshold 
value 1 < A < \V\ and a positive constant c > 1, compute, with success 
probability 1 — o(l), a subset S C V with the property that S contains all 
vertices of PageRank at least A and no vertex with PageRank less than 



We develop a nearly optimal, sublinear time randomized algorithm for Sig- 
nificant PageRanks for any fixed c > 3. The running time of our algorithm is 
always O(^). We show that any algorithm for Significant PageRanks must 
have time complexity of fi{^). Thus, our algorithm is optimal up to logarithmic 
factors. Our Significant PageRanks algorithm applies a multi-scale sampling 
scheme that uses a fast personalized PageRank estimator (see below) as its main 
subroutine. 

1.2 Matrix Sampling and Personalized PageRank Approximation 

While the PageRank of a vertex captures the importance of the vertex collec- 
tively assigned by all vertices in the network, as pointed out by Haveliwala [5], 
one can use the distributions of the following random walk to define the pairwise 
contributions of significances: Given a teleportation probability a and a starting 




PageRank(u) 



= n. 



A/c. 
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vertex u in a network G = (V,E), at each step, with probability (1 — a) go 
to a random neighboring vertex, and with probability a, restarts the process 
from u. For v G V, the probability that v is visited by this random process, de- 
noted by PersonalizedPageRank u (t;), is the it's personal PageRank contribution 
of significance to v. It is not hard to verify that 

Vu € V, 2_. PersonalizedPageRank u (w) = 1; and 

Vu e V, PageRank(w) = PersonalizedPageRank u (u). 

uEV 

Personalized PageRanks has been widely used to describe personalized be- 
havior of web-users [II] as well as for designing good network clustering tech- 
niques [5]. As a result, fast algorithms for computing or approximating per- 
sonalized PageRank can be very useful. One can approximate PageRanks and 
personalized PageRanks by the power method [3], which involves costly matrix 
multiplications for large scale networks. Applying effective truncation, Jeh and 
Widom [9] and Andersen, Chung, and Lang [2] developed personalized PageR- 
ank approximation algorithms that can find an e-additive approximation in time 
proportional to the product of (T 1 and the maximum in-degree in the graph. 

Our sublinear-time algorithm for Significant PageRanks also requires 
fast subroutines for estimating personalized PageRanks. It uses a multi-scale 
sampling approach by selecting a set of precision parameters {e\, eh} where h 
depends on n and A, e,; = l/2\ Then, for each i in range 1 < i < h, it computes 
the £i-precise personalized PageRanks defined by a sample of Ofan/A) vertices. 
For networks with constant maximum degrees, we can simply use the Jeh- Widom 
or Andersen-Chung-Lang personalized PageRank approximation algorithms in 
our multi-scale sampling scheme. However, for networks such as web graphs 
and social networks that may have nodes with large degrees, these two earlier 
algorithms are not robust enough for our purpose. 

We develop a new local algorithm for approximating personalized PageRank 
that satisfies the desirable robust property that the multi-scale sample scheme 
requires. Given p, e > and a starting vertex u in a network G = (V,E), our 
algorithm estimates each entry in the personalized PageRank vector, 

PersonalizedPageRank„ ( . ) 

defined by u to a multiplicative factor of at most (1+p) plus an additive precision 

error of at most e Q The time complexity of our algorithm is O( los(|v| e ) ' 2 og(e ' } ). 
Our algorithm requires a careful simulation of random walks from the starting 
node u to ensure that its complexity does not depends on the degree of any node. 

Our algorithms can be efficiently implemented in various network querying 
models assuming no direct global access to the network. In particular, our algo- 
rithms can be efficiently implemented in the Jump and Crawl query model [B], 

4 Formally, estimated value val of val would have the property that (1 — p) ■ val — e < 
val < (1 + p) ■ val + e. 
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making the algorithm suitable for processing large social and information net- 
works. 

In particular, our sublinear algorithm for Significant PageRanks could 
be used in Web search engines, which often need to build a core of web-pages 
to be later used for web-search. It is desirable that pages in the core have high 
PageRank values. These search engines usually apply crawling to discover new 
significant pages and add them to the core. The property that our sublinear- 
timc algorithms have a natural implementation in the Jump and Crawl model 
may make them useful in a search engine for selecting pages with high PageRank 
values to update the current core by using them to replace the existing core pages 
that have relatively low PageRank values. We anticipate that our algorithm for 
Significant PageRanks will be useful for many other network analysis tasks. 

1.3 Additional Related Work 

For personalized PageRanks approximation, in addition to the work of [2,4,5, 
[SlfTTj. Andersen et al [T] developed a 'backward' version of the local algorithm 
of [5] ■ Their algorithm finds all nodes that contribute at least some fixed fraction 
p to a page's PageRank in time O(d m ax-out/p) where d max -out is the maximum 
out-degree in the network. This algorithm can be used to provide some reliable 
estimate to a node's PageRank. For example, for a given k, in time 0{k) it 
can bound the total contribution from the k highest contributors to the node's 
PageRank. However, for networks with large d ma x-out, its complexity may not 
be sublinear. 

As suggested in [I], one can view the entire set of personalized PageRanks 
(defined by all vertices in a network) as an | V\ x \ V | matrix, which is referred to as 
the PageRank matrix of the graph. In the PageRank matrix, each row represents 
the personalized PageRanks from a particular vertex, and each column represents 
the contributions to its PageRank from all vertices in the network. Note that 
the sum of each row is 1 and the sum of the u th column is the PageRank of u. 

In light of this, the problem of Significant PageRanks can be viewed 
as a matrix sparsification or matrix approximation problem. There has been a 
large body of work of finding a low complexity approximation to a matrix that 
preserves some of its properties. Perhaps the most relevant one to our goal is a 
low rank matrix approximation under the li matrix norm. 

All current methods for finding such low rank approximations runs in time 
at least linear in the size of the input matrix. See |10) for a survey of recent 
results. 

Next, a linear time Monte Carlo based method to estimate PageRank of all 
nodes is devised in J3J. The method is based on running constant number of 
random walks from each of the nodes in the network. 

Last, in the context of sublinear time graph algorithms, our research is re- 
lated to the work of |12j . in which sublinear time algorithms are presented for 
estimating several quantities. Our implementation of the Jump and Crawl query 
model can be viewed as a stringent type of the adjacency-list graph model used 
in [H]. 
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1.4 Organization 

Section 2 contains the needed definitions and notations. Section 3 presents our 
multi-scale sampling algorithm for Significant PageRanks. In section 4 we 
provide a lower bound construction for Significant PageRanks. In Section 
5 we give a robust local algorithm for approximating personalized PageRank 
vectors. 

2 Preliminaries 

We consider a network which is defined as a directed graph G = (V, E) with 
n nodes and m edges. Usually, a network is massive. Our algorithms access a 
network using a rather natural implementation of the Jump-and-Crawl query 
model of [B] developed for processing large social and information networks. The 
Jump and Crawl model is concerned with informational complexity of nodes, 
where each node access reveal its full list of adjacent neighbors at no extra 
cost. Our algorithms shall be designed to work under the following compelling 
implementation of the Jump-and-Crawl query model. We allow two types of 
queries: 

— Jump: A call to the Jump query needs no input and returns a uniformly at 
random node from the network. 

— RandomCrawl: A call to the RandomCrawl query requires a vertex v as 
input. RandomCrawl(v) returns a uniformly at random out-neighbor of v. 

Note for example, that the random surfer procedure used in the definition of 
PageRank is itself a natural algorithm under our implementation of the Jump- 
and-Crawl query model. 

We now move to define personalized PageRank as well as PageRank. Mathe- 
matically, the personalized PageRank vector of a node v is the stationary point 
of the following equation: 

PersonalizedPageRanky(-) = al v + (1 — a)PersonalizedPageRank t ,(-) • D~ x A, 

where a is the teleportation probability, A is the adjacency matrix of the directed 
network G = (V, E) so A(i,j) — 1 iff G E. In this notation, D is a diagonal 
matrix with d ou t(v) at entry (v,v) and l v is the indicator vector of v. We will 
follow the standard [4] by assuming that each node has at least one out-link 0. 
Then, one can define the RageRank vector as 



Note that in this definition, the sum of the all PageRank values is equal to n. 

Following [I] , we define a matrix PPR (short for personalized PageRank) to 
be the n x n matrix, whose v th row is PersonalizedPagcRank^-). 

Unless stated otherwise, for any x, log(a;) would mean log 2 (:c). 

5 Otherwise, as commonly done [J, consider that node as having out links into all 
nodes in the network. 



PageRank(-) 
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3 Multi-scale Matrix Sampling and Approximation of 
PageRank 

In this section, we present our nearly optimal, sublinear time algorithm for Sig- 
nificant PageRanks. Recall that 

Significant PageRanks: Given a network G = (V,E), a threshold 
value 1 < A < \V\ and a positive constant c > 1, compute, with probabil- 
ity 1 — o(l), a subset SCF with the property that S contains all vertices 
of PageRank at least A and no vertex with PageRank less than A/c. 

Note that the PageRank value of each vertex is at least a and at most n. 
Instrumental to our algorithm, we present a multi-scale algorithm for sampling 
the PageRank matrix PPR that achieves, for any fixed c > 3, the following goals: 
The algorithm makes O(^) total queries and updates, and with high probability, 

1. For each vertex with PageRank value at least A, the sum of the sampled 
entries of of the column corresponding to the vertex will provide a quality 
estimate to the PageRank value of that vertex. 

2. The algorithm does not return any vertex whose PageRank value is less than 
A/c. 

In our algorithm, we will use a new local algorithm ApproxRow for per- 
sonalized PageRank approximation. Algorithm ApproxRow takes three input 
parameters: v € V, an additive error factor e G (0, 1) and a multiplicative fac- 
tor p £ (0,1). It returns an approximation to PersonalizedPageRank„(-) such 
that for every PPR(i;, j) > e, it returns a non-negative estimated value between 
(1 — p)PPK(v, j) — e to (1 + p)PPK(v, j) + e. The running time of ApproxRow 
is essentially O( log ("^° 2 s ( e — ly ApproxRow and its analysis will be presented in 
Section [5] 

We start with some high-level idea of our multi-scale sampling algorithm. To 
assist our exposition, we will present our algorithm and its analysis for c = 6. 
Both are easily extended to any other constant value c > 3. Our algorithm will 
use O(logn) precision scales: e t = 2~* for < t < log(^-). We conceptually di- 
vide each column of the PPR matrix into chunks, where the chunk corresponding 
to et contains its entries with values between e t to 2e t . Thus, we ignore all entries 
in the PPR matrix column of value less than , the finest scale. Note that en- 
tries with value at most ^ can contribute to at most a quarter to the PageRank 
of a vertex whose PageRank value is least A. 

If the sum of a chunk's entries is at least A/ (2 log(n)), we will refer to it 
as a heavy chunk. The central idea of our algorithm is to efficiently generate 
robust estimates of the sums for all heavy chunks, as we shall show that it is 
also sufficient to only provide estimates to heavy chunks. 

As the entries in each chunk are within a factor of 2 of each other, we then 
reduce the task of estimating the sum in a chunk to the problem of approximately 
counting the size of the chunk. Then conceptually, we estimate the size of each 
heavy chunk at scale et by taking 0(ei4n/ A) random entries from its column 
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and counting the numbers of samples in this chunk. The challenge we need to 
overcome is to efficiently sample all heavy chunks at a scale simultaneously. 

This is where we will use our local PageRank approximation algorithm Ap- 
proxRow, which in Q( lo s(") lo s( e — 1) time when given a vertex v, returns robust 
estimates to all entries of values at least e in u's row in the PPR matrix. To 
achieve 0(n/A) queries and running time, we call ApproxRow 0(^e t ) times at 
scale et, and we will show that it is sufficient to sample this much (or little). 

In the last step of the algorithm, for each node j, we will simply sum up over 
all scales e t , its estimated values weighted by a normalizing factor £t2 k^( K ) • 

Then the algorithm will output only those j's where the sum is at least ^ and 
their estimated PageRank values. 

A detailed pseudo-code of our algorithm, ApproximatePageRank, is given 
below. 



Algorithm 1 ApproximatePageRank 



Require: PageRank threshold A, a network G = (V, E) on n nodes accessible only by 
Jump and RandomCrawl queries. 
// First-Part // 

1: Initialize a binary search tree, ChunkTree, indexed lexicographically by a two-tuple 
key (nodelD, e). 
for t = to log(^) do 

Set the additive error et = 2~ . 
for (^e t 41og 2 (n)) times do 

Jump to a random node, call it v. 

Call list = ApproxRow(u, |) and update the chunk size estimate affiliated 
vertices in list as the following: 
for each pair (nodelD, et) in the list do 

if there exists an entry e with key (nodelD, e t ) in ChunkTree then 

Update entry e's value by adding 1 to its current value, 
else 

Create an entry in ChunkTree with key (nodelD, e t ) and value 1. 
end if 
end for 
end for (at scale et). 
end for (for all scales) 
// Second-Part // 

16: Initialize a final tree, called TreeofPageRank Values, indexed by key (nodelD). 
17: for all elements (chunks) in ChunkTree that all belong to same node i (namely, 
have i as the first part of their key) do 

if chunk has value, val, at least i log(n) then 
Let e be the second part of the chunk's key. 

Add 2e log's ( n ) to the entry indexed by (i) in TreeofPageRank Values, 
end if 

Output all elements in TreeofPageRank Values with at least A/ A 
end for 
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In the proofs for the following two theorems, we will analyze the performance 
of this algorithm. Note that we will ignore the dependence of the running time on 
a as for all standard PageRank computations, it is taken to be a fixed constant 
independent of input size [4] . 

Theorem 1 (Complexity of ApproximatePageRank) . The runtime of al- 
gorithm ApproximatePageRank is upper bounded by 0(n/A). 

Proof The algorithm uses 0(log(n/Z\)) scales. In First-Part of the algorithm, 
for scale et, it makes -^^4 log 2 (n) Jump queries and for each query it runs 
ApproxRow(w, e t /2, 1/2), where v represents the random vertex returned by the 
query. ApproxRow then has a runtime of Q( log (")* og ( £ « ) ). Thus, the total run- 
time complexity is O(^) as the finest scale is A/n and there are at most logn 
scales. In addition to the time spent on querying the network, the algorithm 
takes <9(log(n)) per step overhead for each access/update in its data structure. 

In Second-Part of the algorithm, it makes no new queries. As there are only 
0(n/A) items in the data structure ChunkTree and then TreeofPageRank Values, 
the complexity of this summation part is 0{n/A). The last step of outputting 
all nodes in the tree with value bigger than a threshold can easily be done in 
linear time in the size of the tree, which is 0(n/A). □ 

Theorem 2 (Correctness of ApproximatePageRank). Given A and con- 
stant c > 3 7 ApproximatePageRank outputs, with probability 1 — o(V), all nodes 
with PageRank at least A but no node with PageRank smaller thaiXj A/c. 

Proof For v G V, let (p\,p2, ■ ■ ■ ,p^) be w's column in the PPR matrix. Let 
ChunkSet(w, e) = {i : e < p\ < 2e}, ChunkSize(u, e) = |ChunkSet(w, e)|, and 
ChunkSum(w, e) = £™ =1 {pV : e < pV < 2e}. 

Recall a chunk is heavy is its chunksum A/\og{n). We now prove that at the 
end of First-Part in Algorithm ApproximatePageRank, all heavy chunks are well 
approximated. 

To focus on the essence of the proof for multi-scale matrix sampling, we first 
assume that all the values returned by ApproxRow are exact (with no error at 
all) . We call this assumption, the perfect row approximation assumption. We will 
later show that when removing this assumption the approximation scheme would 
only be affected by a multiplicative factor of three, namely the effective value of 
c in Significant PageRanks would be one third its value under perfect row 
approximations . 

Lemma 1 (key lemma). Let e t — 2~* 7 for 1 < t < 4^. The following holds 
with probability 1 — o(l): 

— // ChunkSumlv, e t ) > 2 \^ n ^ then at the end of First-Part in the algorithm 
the entry in ChunkTree with key (v,et), namely the algorithm's approxima- 
tion of ChunkSizeiv , e t ) , is at least logn/2 and is between chunkS ™ e ( v - e *) . 
e f 21og 2 (n) to C/lM " fcS J e( "' £t) • e t 8\og 2 (n). 



Again for exposition, we present our algorithm and its analysis for c = 6. We later 
show that the theorem on a slightly modified algorithm holds for any constant c > 3. 
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— If ChunkSum(v , e t ) < 4 io^( n ) then at the end of First-Part in the algorithm 
the entry in ChunkTree with key (v,et), namely the algorithm's approxima- 
tion of ChunkSize(v , e t ) , is smaller than loB vV . 

Proof Note that ^ChunkSum(v, e t ) < ChunkSize(t>, e t ) < ^ChunkSum(-i>, e t ). 
So, if ChunkSum(w, et) > 2 lo ^(„) then ChunkSize(t>, e t )/n > A/ (Ae t n\ogn). 

Thus, when sampling 4ejn log 2 (n)/Z\ random rows (as in line 5 of the algo- 
rithm), the expected number of entries in the chunk that ApproxRow discovers 
is at least ChunkSize(v, e t )e t 41og 2 (n)/Z\ > log(n). By a standard multiplicative 
Chernoff bound (see appendix), with probability 1 — o(l), after multiplying the 
count by A/{2e t log 2 n), we can approximate ChunkSize(u, e t ) within a multi- 
plicative factor of 2. Moreover, if ChunkSum(v, e t ) < 4 lo ^ n ) then, its estimated 

value is at most twice its value, namely smaller than los i"^ , □ 

Lemma 2. The following holds with probability 1 — o(l) under the perfect row 
approximation assumption: 

— If PageRank(v) > A, then the algorithm will output v and will estimate its 
PageRank value to a value between PageRank(v)/A to 2PageRank(v). 

— If PageRank{v) < A/8, then the algorithm will not output v. 

— If A/8 < PageRank(v) < A, then the algorithm might output v. If v is 
outputted, then its estimated PageRank value is between PageRank{v) j '16 to 
2PageRank(v). 

Proof By lemma [TJ that the sums of each heavy chunk are well estimated to 
within a multiplicative factor of 2. 

Since there are at most logn chunks in column, the contribution from all 
non-heavy chunks is at most log n(A/ (2 logn)) = A/2. Thus, if v's PageRank 
is at least A, then the contribution from its heavy chunks is at least A/2. Con- 
sequently, and the algorithm's approximation to u's PageRank will be at least 
A/ A and at most 2 A, and this vertex will be outputted. 

We can similarly establish the other two cases as stated in the lemma. □ 
We now turn to discuss the effect of having only approximate values com- 
puted in ApproxRow calls on the guarantees of ApproxmatePageRank. 

Lemma 3. Given parameters < e,p < 1, removing the perfect row approxi- 
mation assumption changes the approximation constant c by at most 3 times its 
value as well as changes the estimated PageRank values computed by Approxi- 
matePageRank to be at most three times their value. 

Proof The PPR matrix is effectively computed using calls to ApproxRow by 
the algorithm. 

Given e > 0, consider an element e < PPR(v,j) < 2e for some nodes v,j. 
There are two sources for having this element estimator differ from it real value. 
First, ApproxRow (with parameters e = e/2 and p = 1/2) computes approximate 
values so the estimated value is between (1 — p — 1/4) its real value to (1 + p + 
1/4) (we could put the additive e/4 error in the multiplicative approximation 
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factor since e < PPR(v, j) < 2e). In particular, in the algorithm we pick p = 
1/2. However, one could replace both p and e with smaller values to get an 
approximation that gets closer to the true value: Replacing p by hp and e by 
kit for any integral h-,^1 would increase the total runtime by only a factor of 
k\k 2 \°!?( e k ~i} ■ The second source why the estimator differs from its true value is 
double counting. An element with a true value between e/2 to e as well as one 
with a true value between 2e to 4e could appear in a realization as an element 
with a value between e to 2e. However (by applying Chernoff bound), elements 
with true value smaller than e/2 as well as those with value bigger than 4e would 
not appear as such. Thus due to double counting the sum of elements in each 
column can be at most three times its real value. If we denote the PageRank of 
node j by A(j) and the value it gets from the realized column values by A'(j) 
then, 

(1 - k lP - k 2 /2)A{j) < A'(j) < 3(1 + k lP + k 2 /2)A{j). 

In particular, algorithm ApproximatePageRank uses p = |, h = 1 and h — \ 
which gives 

±A(j) < A'(j) < 6A(j). 

a 

This ends the proof of Theorem [5] □ 



4 Lower Bound Construction for PageRank 
Appr oximat ions 

We now turn to prove a corresponding lower bound for PageRank approxima- 
tions. 

The lower bound construction will show that, any algorithm, making less 
than ^(-^) Jump and Crawl queries, will fail, with constant probability, to find 
any node with PageRank at least A on the graph. This holds true for any type 
of implementation of a Crawl query (including the RandomCrawl one). Given 
positive integers n and A < ? — 1, we construct an undirected graph on n nodes 
made of a path subgraph on n — d — 1 nodes and an isolated star subgraph on 
d + 1 nodes, where d = 2 A. See figure 1 for an illustration. Fix < a < 1, the 
teleportation probability. By solving the PageRank equations it is not hard to 
check that each node on the path subgraph has PageRank value of 1, the hub 
of the subgraph has PageRank | + 2(i-a) anc ^ eacn l ea f °f the star subgraph 

has PageRank of \{d-\- 1 — f — 2 (i-a) ) — I + 2- ^ s ^ ~ 3> tQe on ^ n °de with 
PageRank at least A is the hub of the star subgraph. However, for any e > 0, 
in order to find any node that belongs to the star subgraph one needs to make, 
with probability at least 1 — ^ — e, at least § = ^(3) Jump queries. 
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Fig. 1. An example illustrating the path-star graph of the lower bound construction 
for PageRank computations. 



5 Local Robust Computation of Personalized PageRank 

We now describe a method, ApproxRow, based only on local computations that 
approximates a node's personalized PageRank vector. The pseudo-code is given 
on the next page. 

Theorem 3 (Complexity of ApproxRow). For any node v and values < 
e, p < 1, the runtime of ApproxRow{v , e, p) is upper bounded by Q( 1 °^i" og 1 ° ^'i j )• 

Proof The algorithm performs ■ 16 log(n) rounds where at each round it 
simulates a random walk with termination probability of a for at most length 
steps. Each step is simulated by taking a Jump ('termination' step) with proba- 
bility a and taking a RandomCrawl step with probability 1 — a. Thus the total 
number of queries used is ±^ • log^ (f ) = O(^^Jg). □ 

Theorem 4 (Correctness of ApproxRow). For any node v and values < 
e,p< 1, with probability of at least 1 — O(^), ApproxRow(v, e, p) computes a 
list I with the following properties: 

— Every node j that is outputted in the list I has an estimated value which is 
non-negative and lies between (1 — p)PPR(v,j) — | to (1 + p)PPR(v,j). 

- Every node not in the list I has PPR(v,j) < e/2. 

Proof We start with an observation. The personalized PageRank contribution 
from a node v to node j is exactly the probability that a random walk that starts 
at v, and at each time step terminates with probability a, and with probability 
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Algorithm 2 ApproxRow 

Require: A node v in G = (V, E), additive error parameter < e < 1, multiplicative 
approximation parameter < p < 1, teleportation probability < a < 1. 
1: Initialize a binary search tree NodeCountTree where the key is a node's identity. 
2: Set length = log _i (f). 
3: Set r = ^ ■ 161og(n). 
4: for r times do 

5: Run one realization of a random walk with restart probability a: the walk starts 
at v and at each time makes, with probability a a 'termination' step by returning 
to v and terminating, and with probability 1 — a a RandomCrawl step. The walk 
is artificially stopped after length steps if it has not terminated already. 

6: if the walk visited a node u just before making a termination step then 

7: Add 1 to the count stored at it's entry in NodeCountTree. 

8: end if 

9: Output all nodes in NodeCountTree together with their average count (over the 

r rounds). 
10: end for 



1 — a moves to a random out- link of the node it is currently at, was at node 
j one step before termination. Define l v to be the indicator vector of v. The 
proof of the observation follows from a series of algebraic manipulations on the 
definition of the PersonalizcdPagcRank Vector of v. 

PersonalizcdPageRank t) (-) = al v + (1 — a)PersonalizedPageRank t ,(-) • D^ 1 A. 

Solving the system gives PersonalizedPageRank„(-) = al v (I— (1 — a)D~ 1 A)~ 1 = 
al v ^2°Z ((1 — a)!? -1 A) 1 . This last equation makes the observation clear. 

Given a node j, denote by Pk(v, j) the contribution to v's Personalized PageR- 
ank vector from walks that are of length at most k. By the above observation, 
p k (v,j) = al v ^ =0 ((l-a)D- 1 A) i . 

We ask how much is contributed to j's entry in the Personalized PageRank 
vector of v from walks of length bigger or equal to k. The contribution is at most 
(1 — a) since the walk needs to survive at least k consecutive steps. Taking 
(1 — a) k < | will guarantee that at most | is lost by only considering walks of 
length smaller than k, namely: PPR(u, j) — | < Pk{v,j) < PPR(w, j). 

For that it suffices to take k — log_j__ (|). This is exactly length, the length 
of each walk the algorithm simulates, is set to. 

Next, the algorithm provide a estimate of pk (v,j) by realizing walks of length 
at most k. The algorithm does so by taking the average count over ■ 16 log(n) 
trials. Denote the algorithm's output by Pk(v,j)- Then, if Pk{v,j) > f, by the 
multiplicative Chernoff bound, Pr(pk(v,j) > (1 + p)pk(v < exp(— 2 log(n)) 
and Pr(p k (v,j) < (1 - p)pk(v,j)) < exp(-2 log(n)). 

We can conclude that (1 - p)(PPR(u, j) - f ) < Pk(v,j) < (1 + p)FPR(v,j). 

In particular, nodes with PPR(u, j) > e will be estimated to a positive value 
and outputted as claimed. 
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Similarly, if Pk{v,j) < f then by the multiplicative Chcrnoff bound, 
Pr{pk{v,j) > |) < exp(— 2 log(n)). In this case we conclude that p\(v,j) < §. 
Also, PPR(«,j) < f + f < | so \PPR(v, j)-p k (v,j)\ < f. □ 
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Appendix: Concentration Bounds 

Lemma 4 (multiplicative ChernofT bound). Let Xi be i.i.d. Bernoulli ran- 
dom variables with expectation each. Define X — Y^i=i Xi- Then, 

- For < A < l,Pr[X < (1 - A^n] < cxp(- y unA 2 /2). 

- For < A < l,Pr[X > (1 + X)fJ,n] < exp(-^nA 2 /4). 

- For A > l,Pr[X > (1 + A)/xn] < exp(-/xnA/2). 



